Convert build process to saga framework#842
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (16)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (10)
📝 WalkthroughWalkthroughThis PR adds a saga-orchestrated build pipeline: StreamRegistry and StatusRegistry for staging tar streams and streaming build status; refactors Builder.buildFromDir to use detectBuildStack and runBuildkitBuild; defines a restartable build-from-tar saga with compensating undo actions; provides a SagaBuilder RPC wrapper that manages streams/status and populates results; adds tests for sagas, registries, storage conformance, and coordinator wiring; and conditionally enables the saga builder via the sagas feature flag. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
servers/build/build_saga_test.go (2)
109-124: ⚡ Quick winMake stub artifact names unique per run.
Using a fixed
<app>-stubentity name makes this helper fragile if a test starts the saga more than once against the same in-memory store.♻️ Proposed fix
- artifactName := in.AppName + "-stub" + artifactName := in.AppName + "-" + in.VersionName + "-stub"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@servers/build/build_saga_test.go` around lines 109 - 124, The helper currently uses a fixed artifactName ("artifactName := in.AppName + \"-stub\"") which causes collisions when the saga is started multiple times against the same in-memory store; change artifactName generation in the build helper to include a run-unique suffix (e.g., append time.Now().UnixNano() or a generated UUID) before calling deps.builder.ec.Create so each created entity name is unique and avoids test interference with artifact creation via deps.builder.ec.Create.
293-309: ⚡ Quick winAssert compensation by checking entity absence, not only
UndoneAt.This currently verifies saga metadata only. Add direct store checks for the created
ConfigVersion/AppVersionentities to ensure compensators actually removed persisted rows.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@servers/build/build_saga_test.go` around lines 293 - 309, The test currently only checks exec.ExecutedActions[actionCreateConfigVer|actionCreateVersion].UndoneAt; extend it to assert the compensators actually removed persisted rows by attempting to load the corresponding entities (ConfigVersion and AppVersion) from the store and expecting a "not found" result. After you retrieve exec from storage.Get, derive the created entity IDs from the execution record or action payload, then call the store read methods for ConfigVersion and AppVersion (use the project's existing load/get functions) and assert they return a not-found error (or nil result) rather than a valid entity; keep these checks adjacent to the existing UndoneAt assertions for actionCreateConfigVer and actionCreateVersion.servers/build/build_saga.go (1)
659-689: 💤 Low valueMinor: Unused variable workaround can be removed when migration check is updated.
Line 675 has
_ = statusto suppress the unused variable warning. The comment at lines 667-672 explains this is a temporary regression. Consider tracking this as technical debt.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@servers/build/build_saga.go` around lines 659 - 689, The _ = status no-op exists to silence an unused variable in finalize; remove this workaround (delete the `_ = status` line) and either use status where intended or add a short TODO comment referencing the pending migration work so the linter won't be suppressed silently; locate this in the finalize function (symbols: finalize, status, checkLocalStorageMigration) and ensure no other unused-variable warnings remain after removal.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@servers/build/stream_registry.go`:
- Around line 140-152: The cleanup currently deletes s.staged[streamID] and
s.streams[streamID] while holding s.mu before calling os.RemoveAll, so if
RemoveAll fails the registry loses the staged path; change the flow in the
Cleanup method to first read (but not delete) the staged path under s.mu,
release the lock, call os.RemoveAll(path), then re-acquire s.mu and only delete
s.staged[streamID] and s.streams[streamID] if removal succeeded (or restore/keep
the entry on error). Use the existing s.mu, s.staged, s.streams, and
os.RemoveAll references to locate and modify the logic accordingly.
---
Nitpick comments:
In `@servers/build/build_saga_test.go`:
- Around line 109-124: The helper currently uses a fixed artifactName
("artifactName := in.AppName + \"-stub\"") which causes collisions when the saga
is started multiple times against the same in-memory store; change artifactName
generation in the build helper to include a run-unique suffix (e.g., append
time.Now().UnixNano() or a generated UUID) before calling deps.builder.ec.Create
so each created entity name is unique and avoids test interference with artifact
creation via deps.builder.ec.Create.
- Around line 293-309: The test currently only checks
exec.ExecutedActions[actionCreateConfigVer|actionCreateVersion].UndoneAt; extend
it to assert the compensators actually removed persisted rows by attempting to
load the corresponding entities (ConfigVersion and AppVersion) from the store
and expecting a "not found" result. After you retrieve exec from storage.Get,
derive the created entity IDs from the execution record or action payload, then
call the store read methods for ConfigVersion and AppVersion (use the project's
existing load/get functions) and assert they return a not-found error (or nil
result) rather than a valid entity; keep these checks adjacent to the existing
UndoneAt assertions for actionCreateConfigVer and actionCreateVersion.
In `@servers/build/build_saga.go`:
- Around line 659-689: The _ = status no-op exists to silence an unused variable
in finalize; remove this workaround (delete the `_ = status` line) and either
use status where intended or add a short TODO comment referencing the pending
migration work so the linter won't be suppressed silently; locate this in the
finalize function (symbols: finalize, status, checkLocalStorageMigration) and
ensure no other unused-variable warnings remain after removal.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 23e291f7-2359-4312-9c88-01f1636a9838
📒 Files selected for processing (11)
components/coordinate/coordinate.gopkg/labs/features.yamlpkg/labs/labs.gen.goservers/build/build.goservers/build/build_saga.goservers/build/build_saga_test.goservers/build/saga_builder.goservers/build/status_registry.goservers/build/status_registry_test.goservers/build/stream_registry.goservers/build/stream_registry_test.go
ccf6430 to
a435c56
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@servers/build/stream_registry.go`:
- Around line 85-96: The code deletes s.streams[streamID] before calling
os.MkdirTemp, so if MkdirTemp fails the unread reader is lost; change Stage so
that either (a) you create the temp dir before deleting the entry, or (b) if you
must delete first, on MkdirTemp error reinsert the saved reader back into
s.streams under s.mu to restore state. Concretely: keep the local reader
variable, call os.MkdirTemp, and only delete s.streams[streamID] after
successful dir creation; or if you keep the delete earlier, surround the
MkdirTemp call and on error acquire s.mu and set s.streams[streamID] = reader
before returning the error (use s.mu to protect the map).
- Around line 128-133: MarkStaged currently overwrites any existing staged path
for a streamID, losing the original prepared directory; change
StreamRegistry.MarkStaged so it preserves the first staged path by checking
s.staged for streamID and only setting s.staged[streamID] = path if it does not
already exist, while still removing the entry from s.streams; reference
StreamRegistry.MarkStaged (and behavior consistent with Register/Cleanup) to
locate the method to update.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 6862bc7d-155d-4a57-9af9-2bdefc27e080
📒 Files selected for processing (11)
components/coordinate/coordinate.gopkg/labs/features.yamlpkg/labs/labs.gen.goservers/build/build.goservers/build/build_saga.goservers/build/build_saga_test.goservers/build/saga_builder.goservers/build/status_registry.goservers/build/status_registry_test.goservers/build/stream_registry.goservers/build/stream_registry_test.go
✅ Files skipped from review due to trivial changes (2)
- pkg/labs/labs.gen.go
- pkg/labs/features.yaml
🚧 Files skipped from review as they are similar to previous changes (8)
- components/coordinate/coordinate.go
- servers/build/status_registry_test.go
- servers/build/build_saga_test.go
- servers/build/build_saga.go
- servers/build/saga_builder.go
- servers/build/stream_registry_test.go
- servers/build/status_registry.go
- servers/build/build.go
8f513b0 to
e79d406
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
servers/build/build_saga.go (1)
438-442: 💤 Low valueConsider logging a warning if EphemeralExpiresAt fails to parse.
The parse error is silently ignored. While this value comes from
handleEphemeralwhich formats with RFC3339 (so it should never fail), silent failures could mask data corruption issues where ephemeral versions never expire.Suggested enhancement
if in.EphemeralExpiresAt != "" { - if t, err := time.Parse(time.RFC3339, in.EphemeralExpiresAt); err == nil { + t, err := time.Parse(time.RFC3339, in.EphemeralExpiresAt) + if err != nil { + b.Log.Warn("failed to parse ephemeral expiration time", "value", in.EphemeralExpiresAt, "error", err) + } else { av.EphemeralExpiresAt = t } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@servers/build/build_saga.go` around lines 438 - 442, The parse error for in.EphemeralExpiresAt is currently ignored; update the time.Parse branch to log a warning when parsing fails (include the offending in.EphemeralExpiresAt string and the parse error) while preserving current behavior when parse succeeds (set av.EphemeralExpiresAt = t). Use the package's logger (or the standard log package) to emit the warning from the same context where in.EphemeralExpiresAt and av.EphemeralExpiresAt are handled.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@servers/build/build_saga.go`:
- Around line 438-442: The parse error for in.EphemeralExpiresAt is currently
ignored; update the time.Parse branch to log a warning when parsing fails
(include the offending in.EphemeralExpiresAt string and the parse error) while
preserving current behavior when parse succeeds (set av.EphemeralExpiresAt = t).
Use the package's logger (or the standard log package) to emit the warning from
the same context where in.EphemeralExpiresAt and av.EphemeralExpiresAt are
handled.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 2d60c1b9-1aa4-4a0e-a8ac-69b5d4d9f694
📒 Files selected for processing (12)
components/coordinate/coordinate.gopkg/labs/features.yamlpkg/labs/labs.gen.goservers/build/build.goservers/build/build_saga.goservers/build/build_saga_buildkit.goservers/build/build_saga_test.goservers/build/saga_builder.goservers/build/status_registry.goservers/build/status_registry_test.goservers/build/stream_registry.goservers/build/stream_registry_test.go
✅ Files skipped from review due to trivial changes (2)
- pkg/labs/features.yaml
- pkg/labs/labs.gen.go
🚧 Files skipped from review as they are similar to previous changes (8)
- servers/build/status_registry.go
- components/coordinate/coordinate.go
- servers/build/status_registry_test.go
- servers/build/stream_registry_test.go
- servers/build/build.go
- servers/build/saga_builder.go
- servers/build/build_saga_test.go
- servers/build/stream_registry.go
The build pipeline (BuildFromTar / BuildFromPrepared / buildFromDir) ran as a 450-line synchronous function whose only crash recovery was defer-based cleanup that died with the process. A server restart mid build could leave orphaned ConfigVersion or AppVersion entities, with no way to either resume the build or compensate the partial state. Converted to the saga framework from MIR-439, following the sandbox saga pattern from MIR-440. Each phase of the build becomes a saga action with an undo; a new StreamRegistry stages incoming tar data to disk so recovery survives the loss of the original io.Reader, and a symmetric StatusRegistry handles outbound progress streaming without coupling actions to the per-request RPC stream. SagaBuilder implements the Builder RPC interface and coordinate.go selects it behind the existing labs.Sagas() flag. Closes MIR-441
e79d406 to
a0d23d4
Compare
Converting the build process to sagas (the parent commit) surfaced two
framework bugs that only bite a saga which reads its own state back or
shares storage with another executor. The build saga does both.
EntityStorage.Save and EACStorage.Save persisted through create-if-
absent primitives (EnsureEntity / EAC Ensure), so every save after the
first was silently dropped. A saga's record froze at its initial
pending state: status never advanced to completed, no action results
were recorded. Reading outputs back failed ("status pending, expected
completed"), and on restart recovery re-ran already-finished sagas
because they still looked pending. MemoryStorage overwrites, so unit
tests stayed green while both production backends were broken. Save now
upserts: EntityStorage ensures-then-replaces, EACStorage uses Put.
Recovery treated a saga whose definition isn't in the executor's
registry as an error. But storage is shared across executors (build and
sandbox each run their own) and ListIncomplete returns every executor's
sagas, so each tripped over the other's on startup. Recovery now skips
definitions it doesn't own, leaving them to their owning executor.
A storage conformance suite runs the persistence scenarios against all
three Storage backends so a backend that silently drops updates can't
pass again.
| @@ -0,0 +1,169 @@ | |||
| package saga | |||
There was a problem hiding this comment.
We picked this up in another PR already I'm pretty sure.
| // prepareConfig assembles the final ConfigSpec for the new version by | ||
| // merging build outputs, app.toml, Procfile, and existing app config, | ||
| // then runs every blocking validation (services exist, required vars | ||
| // have values, node ports are free, disk references resolve). Pure | ||
| // computation + entity reads — no side effects, no undo needed. | ||
| // | ||
| // Validation failures surface as the saga error and a user-facing | ||
| // status update. The pre-saga path called validateNodePorts and | ||
| // validateDiskConfigs separately; bundling them with the rest in one | ||
| // action keeps the saga DAG simple and matches what the user | ||
| // experiences as one logical "config prep" step. | ||
|
|
||
| type prepareConfigIn struct { | ||
| AppName string `json:"app_name" saga:"app_name"` | ||
| StreamID string `json:"stream_id" saga:"stream_id"` | ||
| AppID string `json:"app_id" saga:"app_id"` | ||
| BuildResult *BuildResult `json:"build_result,omitempty" saga:"build_result,optional"` | ||
| AppConfig *appconfig.AppConfig `json:"app_config,omitempty" saga:"app_config,optional"` | ||
| ProcfileServices map[string]string `json:"procfile_services,omitempty" saga:"procfile_services,optional"` | ||
| ExistingConfig string `json:"existing_config_json" saga:"existing_config_json"` | ||
| CLIEnvVars []*build_v1alpha.EnvironmentVariable `json:"cli_env_vars,omitempty" saga:"cli_env_vars,optional"` | ||
| } | ||
|
|
||
| type prepareConfigOut struct { | ||
| ConfigSpec string `json:"config_spec_json" saga:"config_spec_json"` | ||
| } | ||
|
|
||
| func prepareConfig(ctx context.Context, in prepareConfigIn) (prepareConfigOut, error) { |
There was a problem hiding this comment.
meta-point: man these are so good for cleaning up the logic. It's immediately obvious what can influence the config.
| func undoSetActiveVersion(ctx context.Context, in setActiveVersionIn, out setActiveVersionOut) error { | ||
| if out.Skipped { | ||
| return nil | ||
| } | ||
| deps := saga.Get[*buildSagaDeps](ctx) | ||
| if err := deps.builder.appClient.SetActiveVersion(ctx, in.AppName, out.PreviousVersionID); err != nil { | ||
| return fmt.Errorf("restoring previous active version on %s: %w", in.AppName, err) | ||
| } | ||
| return nil | ||
| } |
| // StreamRegistry bridges non-serializable tar streams into the saga world by | ||
| // staging them to durable filesystem paths. The pattern, per RFD-35: |
There was a problem hiding this comment.
This is a good abstraction because I could see us eventually having the ability to consume the tar from other places and this makes that easy.

This converts the build pipeline (
BuildFromTar/BuildFromPrepared) from a ~450-line synchronous function into the saga framework from MIR-439. Eleven actions cover the pipeline, from receiving the source tar through building the image, creating the config and version entities, provisioning addons, and activating the new version, each with a compensating undo so a failure partway through rolls back cleanly instead of leaving half-built state behind.Two small primitives bridge the saga world (which needs everything serializable for crash recovery) to the non-serializable RPC stream state.
StreamRegistrystages the inbound tar to disk so a recovered saga can still find the source after the originalio.Readeris gone, andStatusRegistrycarries the outbound progress stream, handing back a no-op sender when the saga is recovering and the client is long gone. ASagaBuilderwraps the whole thing as a drop-in implementation of the existing Builder RPC interface, andcoordinate.goselects it behind the existinglabs.Sagas()flag.The interesting part is what manual testing turned up. Running a real deploy through the saga path failed at the very end with "execution has status pending, expected completed," even though the build had clearly succeeded: the saga's entire state had frozen at its initial pending write. The cause was in the saga framework itself, not the build code.
EntityStorage.SaveandEACStorage.Savepersisted through create-if-absent primitives, so every save after the first was silently dropped. The sandbox saga from MIR-439 had been hitting this the whole time, it just never noticed because nothing read its state back. The fix makes Save a real upsert. A second, smaller bug fell out of the same testing: each executor's recovery treated another executor's sagas (build and sandbox share storage) as errors, so we scoped recovery to skip definitions an executor doesn't own.To keep both from regressing, there's a storage conformance suite that runs the persistence scenarios against all three Storage backends. The deeper question this raised, whether the mock store can be trusted to stand in for etcd in the first place, is addressed separately in the entity.Store conformance suite (#847).